Skip to main content

PIXART-α: A Diffusion Transformer Model for Text-to-Image Generation

This article provides a short tutorial on how to run experiments with Pixart-α — the new transformer-based Diffusion model for generating photorealistic images from text.
Created on November 29|Last edited on January 8
The popularity of text-conditional image generation models like DALL·E 3, Midjourney, and Stable Diffusion can largely be attributed to their ease of use for producing stunning images by simply using meaningful text-based prompts. However, such models require significant training costs (e.g., millions of GPU hours) which seriously hinders the course of fundamental innovation in the field of AI-generated content while increasing CO2 emissions.
Pixart-α is the novel text-to-image diffusion model that only takes 10.8% of the training time of Stable Diffusion v1.5, all while being able to generate high-resolution images (up to 1024 pixels) with quality that is competitive with the aforementioned state-of-the-art image generators.
In this article, we'll explore:
  • The architecture and the training strategy for Pixart-α, especially how the researchers behind the model were able to optimize the training resources.
  • How we can easily run Pixart-α using 🤗 HuggingFace Diffusers and manage our experiments using Weights & Biases.
  • Compare the quality of images generated by Pixart-α with Stable Diffusion XL, a SoTA text-conditional image generation model.
As a note, you can run the code in this report via this Colab Notebook:


Alternatively, jump on this HuggingFace Space to start crafting your prompts using an interactive application 👇


And, since this is a GenAI report, we know what you want upfront: some stunning images to get you started.


Images generated using PixArt-α
9


Table of Contents



Optimization of Text-to-Image Training

The training of advanced text-to-image models such as Stable Diffusion and DeepFloydIF demands immense computational resources.
For instance, training Stable Diffusion v1.5 necessitates 6K A100 GPU days, approximately costing $320,000. Additionally, the training contributes to substantial CO2 emissions, posing environmental stress.
Such a huge cost imposes significant barriers for both the research community and entrepreneurs in accessing those models, causing a significant hindrance to the crucial advancement in the field of AI-generated content.

Comparisons of CO2 emissions and training cost among Text-to-image generation models. Pixart-α achieves an exceptionally low training cost of $26,000. Compared to RAPHAEL, the CO2 emissions and training costs for Pixart-α are merely 1.1% and 0.85%, respectively. Source: Figure 2 from the paper.
1

Given these challenges, the authors of the paper Pixart-α: Fast Training Of Diffusion Transformer For Photorealistic Text-to-Image Synthesis attempts to answer a simple question:
Can we develop a high-quality image generator with affordable resource consumption?
But before understanding how we can optimize this process, let's try to answer a fundamental question:

Why is Text-to-Image Training Slow?

A text-to-image generation task can be decomposed into three aspects:
  1. Capturing Pixel Dependency: Generating realistic images involves understanding intricate pixel-level dependencies within images and capturing their distribution.
  2. Alignment between Text and Image: Precise alignment learning is required for understanding how to generate images that accurately match the text description.
  3. Aesthetic Quality: Besides faithful textual descriptions, being aesthetically pleasing is another vital attribute of generated images.
In the case of models like Stable Diffusion, these three problems are entangled together and the model is trained directly from scratch using vast amounts of data, resulting in inefficient training.
Another problem is with the quality of captions of the current LAION dataset these models are trained using. The current text-image pairs often suffer from text-image misalignment, deficient descriptions, infrequent diverse vocabulary usage, and inclusion of low-quality data. These problems introduce difficulties in training, resulting in millions of iterations to achieve stable alignment between text and images.

An Optimized Three-stage Training Strategy

As discussed earlier, existing methods for training text-to-image models involve entangling the problems of capturing pixel dependency, text-image-alignment, and aesthetic quality together and directly training from scratch using vast amounts of data, resulting in inefficient training. To solve this issue, the researchers behind Pixart-α disentangle these aspects into three decoupled stages:

Stage-1: Pixel Dependency Learning

The class-guided approach (from the DiT paper) has shown exemplary performance in generating semantically coherent and reasonable pixels in individual images. Training a class-conditional image generation model for natural images is relatively easy and inexpensive. Additionally, the researchers behind Pixart-α find that a suitable initialization can significantly boost training efficiency. Therefore, Pixart-α is boosted from an ImageNet-pretrained model, and the architecture of our model is designed to be compatible with the pre-trained weights.

Stage-2: Text-image Alignment Learning

Compared to pre-trained class-guided image generation, achieving accurate alignment for text-to-image generation is more challenging as well as time-consuming. Moreover, the captions of the LAION dataset exhibit various issues, such as text-image misalignment, deficient descriptions, and infrequent vocabulary.
To efficiently facilitate this process, the researchers behind Pixart-α construct a dataset consisting of precise text-image pairs with high concept density. To generate captions with high information density, the researchers behind Pixart-α leverage the state-of-the-art vision-language model LLaVA. Employing the prompt, “Describe this image and its style in a very detailed manner,” the researchers significantly improved the quality of captions.

LAION raw captions v.s LLaVA refined captions. LLaVA provides high-informationdensity captions that aid the model in grasping more concepts per iteration and boost text-image alignment efficiency. Source: Figure 3 from the paper.
2

However, it is worth noting that the LAION dataset predominantly comprises simplistic product previews from shopping websites, which are not ideal for training text-to-image generation seeking diversity in object combinations. Hence, the researchers used the SA-1B dataset that is originally used for segmentation tasks but features imagery rich in diverse objects. By applying LLaVA to the SA-1B dataset, the researchers acquired high-quality text-image pairs characterized by a high concept density.

Examples from the SAM dataset using LLaVA-produced labels. The detailed image descriptions in LLaVA captions can aid the model to grasp more concepts per iteration and boost text-image alignment efficiency. Source: Figure 11 from the paper.
1


Stage-3: High-resolution and Aesthetic Image Generation

In the third stage, the model is fine-tuned using high-quality aesthetic data for high-resolution image generation. Remarkably, it is observed that the adaptation process in this stage converges significantly faster, primarily owing to the strong prior knowledge established in the preceding stages.

The Efficient Text-to-Image Transformer

Pixart-α adopts the Diffusion Transformer (DiT) as the base architecture and tailors the transformer blocks to handle the unique challenges of text-to-image tasks:

Model architecture of Pixart-α. Source: Figure 4 from the paper.
1


Generating Images with Pixart-α

We can use Pixart-α to generate images easily using the PixArtAlphaPipeline from 🤗 HuggingFace Diffusers. We will also use the Weights & Biases autologger for Diffusers to automatically log our generations and all experiment configurations so that they are reproducible and easy to share.
# Install all the dependencies
!pip install diffusers accelerate transformers ftfy sentencepiece wandb

import torch
from diffusers import PixArtAlphaPipeline
from wandb.integration.diffusers import autolog

# Load the pre-trained checkpoints from HuggingFace Hub to the PixArtAlphaPipeline
pipe = PixArtAlphaPipeline.from_pretrained(
"PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16
)

# Enable offloading the weights to the CPU and only loading them on the GPU when
# performing the forward pass can also save memory.
pipe.enable_model_cpu_offload()

# Call WandB Autolog for Diffusers
autolog(init=dict(project="pixart-alpha"))

# Make the experiment reproducible by controlling randomness.
# The seed would be automatically logged to WandB.
generator = torch.Generator(device="cuda").manual_seed(42)

# Generate the images by calling the PixArtAlphaPipeline
images = pipe(
prompt="A dog that has been meditating all the time",
negative_prompt="",
height=1024,
width=1024,
generator=generator,
).images

Here are some results!

Images generated using Pixart-α
12


Comparisons with Stable Diffusion XL

Let's take a look at some examples of images generated by both Pixart-α and Stable Diffusion XL Base-1.0 using the same prompt at a resolution of 1024 pixels. For generating these images, we did not use negative prompts.


Examples of images generated by Pixart-α and Stable Diffusion XL Base 1.0 with the same prompt.
50

Let's pause scrolling through results for some time and reflect on the observations:
  • Images generated by Pixart-α tend to have a more vibrant and sharper color palette, compared to the ones generated by SDXL. For example, check rows 1, 5, 8, 17, 18, 22, and 25 from the table above.
  • Pixart-α exhibits much stronger text-image alignment compared to SDXL. For example, check rows 1, 5, 10, 13, 23, and 24 from the table above.
  • Pixart-α can produce images that are much more detailed, vibrant, and expressive with very short prompts, compared to SDXL. For example, check row 18.
Let's look at a few more images generated by Pixart-α and SDXL:

More examples of images generated by Pixart-α and Stable Diffusion XL Base 1.0 with the same prompt.
33



More Text-Image Alignment Challenges

Let's put the text-image alignment to some more tests and attempt to observe how accurately certain phrases are rendered by Pixart-α compared to SDXL:


Let's obsreve how well Pixart-α is able to align certain phrases in the prompts with the corresponding generated images compared to SDXL.
20


Manipulating Image Styles using the Prompt

Let's now look at the ability of Pixart-α to can directly manipulate the image style with text prompts. In the following panel, we generate five outputs using the styles to control the objects.


The ability of Pixart-α to manipulate image styles uing the prompt.
25


Conclusion

  • In this article, we take a look at the recently released open-source text-to-image generation model Pixart-α.
  • Pixart-α was trained by researchers at Huawei Noah’s Ark Lab at a fraction of the cost of existing text-to-image generation models like Stable Diffusion 1.5, while rivaling SoTA models like Stable Diffusion XL in text-image alignment and fidelity of generated images.
  • We briefly explore the current challenges of training text-to-image models, and how the researchers behind Pixart-α were able to optimize the process to cut down the training time.
  • We also briefly explore the base architecture of Pixart-α derived from Diffusion Transformer (DiT).
  • We explored how we can use Pixart-α to generate images easily using the PixArtAlphaPipeline from 🤗 HuggingFace Diffusers. We will also use the Weights & Biases autologger for Diffusers to automatically log our generations and all experiment configurations so that they are reproducible and easy to share.
  • We also explored the image generation and text-image alignment capability of Pixart-α and also compared it with Stable Diffusion XL on the same prompt in various diverse scenarios.


Iterate on AI agents and models faster. Try Weights & Biases today.